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ABSTRACT 

The  title  of  the  paper  will  be  justified  by  the  consideration  of  the  parallel  computation  of 
vector  norms  and  inner  products  in  floating-point  and  a proposed  new  form  of  computer 
arithmetic,  the  symmetric  level-index  system. 
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1.  Introduction. 

In  this  paper  we  discuss  the  computation  of  vector  p-norms  and  scalar  products  and  the 
implementation  of  algorithms  for  them  on  vector  and  parallel  computers.  For  the  most 
part,  we  concentrate  on  the  vectorization  of  algorithms  for  these  operations.  We  begin 
with  a brief  overview  of  three  approaches  to  the  p-norm  calculation  including  an 
extended  version  of  Blue’s  algorithm  for  the  euclidean  norm  of  a vector  [2].  These 
particular  vector  norms  are  attracting  increased  computational  interest  caused  in  part  by 
their  role  in  the  radial  basis  function  approach  to  bivariate  approximation;  see  [19]  for 
example.  The  analysis  of  finite  element  techniques  also  makes  use  of  these  norms. 

In  Section  2,  we  concentrate  on  the  relative  merits  of  the  various  approaches  for  a 
serial  machine.  Section  3 extends  the  discussion  to  vector  and  (briefly)  to  other  parallel 
architectures.  We  see  there  that  considerations  of  the  ease  of  vectorization  lead  to  very 
different  conclusions  as  to  which  is  the  algorithm  of  choice.  For  other  parallel 
architectures  the  decision  may  well  be  different  again. 

The  choice  and  performance  of  the  algorithm  for  the  computation  of  the  p-norm  is 
not  just  architecture-dependent  but  also  depends  on  the  arithmetic  which  is  to  be  used  and 
on  the  detailed  implementation  of  that  arithmetic.  In  order  to  fix  a framework  for  the 
discussion,  we  adopt  the  Brown  [3]  model  of  floating-point  arithmetic.  In  the 
computational  examples  we  use  the  IEEE  single  precision  standard  [IEEE]. 

The  IEEE  standard  facilitates  comparison  with  the  symmetric  level-index,  or  sli, 
system.  This  arithmetic,  which  is  an  extension  of  the  original  level-index  scheme  of 
Clenshaw  and  Olver  [6],  [7],  is  outlined  in  Section  4.  For  greater  detail  on  this  arithmetic 
and  its  possible  implementation  see  [6],  [7],  [9]  and  the  introductory  survey  [8].  As  long 
as  it  remains  the  case  that  sli  arithmetic  must  be  implemented  in  software,  it  is  difficult  to 
implement  a genuine  simulation  of  the  double-length  symmetric  level-index  scheme. 
Therefore,  we  consider  the  single-length  versions  of  both  arithmetics  for  the  purposes  of 
comparison.  (One  such  software  implementation  for  experimentation  on  PC’s  is  available 
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[21]  while  a Fortran  implementation  with  its  own  precompiler  is  under  development  by 
Lozier.) 

In  Section  5,  we  discuss  floating-point  error  analysis  and  the  analysis  of 
extensions  of  the  floating-point  system  as  well  as  error  analysis  for  sli  arithmetic.  The 
focus  of  the  discussion  is  the  computation  of  extended  sums.  This  operation  is  central  to 
any  of  the  more  complicated  operations  such  as  evaluation  of  scalar  products  or  vector 
norms. 

In  Section  6,  we  consider  some  of  the  sli  vector  algorithms  for  the  tasks  of 
evaluation  of  the  p-norm  and  scalar  products  and  see  that  these  would  be  immediately 
parallelizable  as  soon  as  a symmetric  level-index  arithmetic  processor  is  available. 
Section  7,  on  computational  experience  with  the  algorithms  under  discussion, 
demonstrates  the  advantages  claimed.  The  findings  are  summarized  in  the  final  section  of 
the  paper. 

2.  Algorithms  for  Vector  Norms 

The  vector  space  of  complex  n-tuples  x = (Xj,  x^, x^)  is  a Banach  space  under  the  norm 

llxll  = (llx.|P)'^  (2.1) 

where  p > 1 is  a fixed  real  number.  This  norm  is  called  the  p-norm.  In  the  special  case 
where  p = 2,  the  norm  derives  from  the  usual  inner  product 

( X,  y ) = I x.y.,  (2.2) 

so  that  II  X II2  = (x,  x)^"^.  With  this  2-norm,  the  vector  space  becomes  a Hilbert  space. 
The  special  importance  of  Hilbert  spaces  in  applications  of  mathematics  helps  account 
for  the  existence  of  robust  algorithms  for  the  2-norm  such  as  Blue’s  algorithm  [2]  and  for 
the  existence  of  library  software  in  such  collections  as  NAG,  IMSL,  SLATEC  and  many 
others.  Two  other  special  cases  - each  easier  to  compute  than  the  2-norm  are 
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n 


llxll 


(2.3) 


and 


II  X II 


lx.l. 


(2.4) 


Like  the  2-norm,  the  1-norm  and  the  o=-norm  are  commonly  found  in  software  libraries. 

Little  software  is  available  for  the  more  general  p-norms.  Yet  general  Banach 
spaces  are  used  widely  in  applied  mathematics.  For  example,  finite  element  analysis  is 
rooted  in  Banach  spaces  and  not  Hilbert  spaces  and  one  of  the  more  promising  techniques 
being  developed  for  multivariate  approximation  using  radial  basis  functions  relies 
specifically  on  the  computation  of  p-norms  [19].  Perhaps  the  p-norm  is  available  in  finite 
element  packages  but  these  are  not  readily  accessible  except  to  finite  element 
practitioners. 

We  are  interested  here  in  algorithms  for  computing  p-norms  on  serial  and  vector 
computers.  In  each  case,  we  want  the  algorithm  to  be  matched  to  the  architecture  so  as  to 
speed  the  execution.  Additionally  - and  perhaps  even  more  importantly  - we  seek  robust 
algorithms.  Thus  they  should  return  either  the  correct  answer  or  a clear  indication  of 
failure.  The  first,  and  conceptually  simplest,  algorithm  is  based  directly  on  the  definition. 
ALGORITHM  IS 


s,  = I X,  If 


Set 


Compute  Sj  = s._j  + I x^  f , 


i = 2,  3, ...,  n. 


Set  II  X II  = s*^^  . 

p n 


It  is  generally  accepted  that  floating-point  systems  of  computer  arithmetic  are 
usefully  characterized  by  four  parameters.  See,  for  example,  [3],  [4],  [5]  and  [12].  These 


parameters  are: 
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(3:  the  radix  or  base  (usually  2 or  16), 

t:  the  number  of  p-digits  in  the  significand, 

®min-  minimum  exponent,  and 

^max'  maximum  exponent. 

These  parameters  do  not  describe  an  arithmetic  processor  fully  since,  for  example,  the 
machine  (roundoff)  unit  for  arithmetic  operations  is  not  known  until  the  rounding 
algorithm  is  specified.  Brown’s  contribution  [3]  was  to  assign  the  parameters  of  a 
specific  arithmetic  processor  not  on  the  basis  of  their  static  number  representation,  but  on 
the  basis  of  a four-parameter  arithmetic  model  which  reflects  an  acceptable  level  of 
performance  of  the  actual  arithmetic.  Thus  any  of  t,  e^  and  e^^  may  be  reduced  to 
compensate  for  a processor  that  fails  to  meet  a stipulated  set  of  criteria. 

It  is  not  always  easy  to  determine  the  parameters  that  best  characterize  the 
acceptable  level  of  performance  of  a given  processor.  We  shall  assume,  however,  that 
this  determination  has  been  made,  and  that  we  have  B,  t,  e . and  e available  for  use. 
The  set  of  model  numbers  is  the  set  of  all  normalized  floating-point  numbers 

X = ± O.djd^  ...  d^  X P"  (2.5) 

with  the  integer  e satisfying  e^  < e < e^^^^  and  dj  ^ 0.  Zero  is  included  also.  The  model 
has  a maximum,  the  overflow  threshold, 

A = (1 -p-t)  p^*""  (2.6) 

and  the  set  of  positive  model  numbers  has  a minimum,  the  underflow  threshold, 

X = p^™n  (2.7) 

Obviously,  Algorithm  IS  is  susceptible  to  both  underflow  and  overflow  in  the 
calculation  of  s^.  If  overflow  occurs,  the  computation  cannot  proceed  within  the 
arithmetic  model.  Nevertheless,  the  final  value  of  II  x lip  (if  it  could  be  computed)  may  be 
well  below  the  overflow  threshold  and  therefore  steps  should  be  taken  to  avoid  overflow 
so  that  a usable  result  is  obtained.  A simple  scaling  appears  to  do  the  trick. 
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ALGORITHM  2S 

Set 

m = IxJ,  Sj  = IXj/mlP 

Compute 

Si  = s._j  -1-  Ix/mlP,  i = 2,  3, ...,  n 

Set 

II  X II  = m s^^P  . 

P n 

A robust  program  must  do  somewhat  more.  Firstly,  it  must  check  to  see  whether 


m = 0,  in  which  case  it  must  set  II  x lip  = 0 and  exit.  Secondly,  it  must  check  whether  the 
final  multiplication  will  cause  overflow  - a perhaps  unlikely,  but  still  possible  event  - in 
which  case  it  must  issue  an  appropriate  error  message  and  quit  or,  alternatively,  link  into 
an  error-recovery  system  designed  to  allow  the  programmer  an  opportunity  to  circumvent 
the  overflow.  Thirdly,  the  program  must  take  into  account  the  possibility  of  underflow  in 
the  calculation  of  lx.  / ml'’. 

We  note  that  Algorithm  2S  overcomes  the  vulnerability  to  overflow  of  the  simple 
Algorithm  IS  at  the  expense  of  a preliminary  pass  over  the  data  vector  to  determine  the 
maximal  element.  The  third  of  our  serial  algorithms,  devised  by  Blue  [2]  for  the  2-norm 
case  and  extended  here  to  the  general  p-norm,  avoids  this  cost.  To  simplify  the 
presentation,  we  denote  by  co  the  largest  model  number  satisfying 

CO  < min  (A,  (2.8) 

(cf  (2.6)  and  (2.7))  and  let 

a = and  M = fp!  (2.9) 

where  [ p 1 is  the  smallest  integer  not  less  than  p.  Define  2M  nonoverlapping  intervals  by 
= [(x™-\  a*"],  m = -M+\,  -M+2, . . . M (2.10) 

For  each  m,  is  a scale  factor  that  transforms  the  interval  into  Iq  by  mapping 
X a~^x.  By  definition,  p^*’  powers  of  numbers  in  this  interval  Iq  can  be  formed  in  the 
model  without  overflow  or  underflow.  The  basis  of  the  algorithm  is  the  use  of  a separate 
accumulator  for  each  interval  I . 

m 
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ALGORITHM  3S 


Initialize  Set 

m 

Loop  For  each  x.  9^  0 (i  = 
Compute 
Set 

and 

Ifx.  = 0,  sets®  = 

1 ’ m 


= 0,  m = -M+l, -M+2, M. 

1,  2, n): 

m.  = r log„  Ix.l1 


c(i) 

- + 

\^  Ja  ^1^ 

1 

1 

S® 

= s®') , 

m ^ m.. 

m 

m ’ 

i 

for  all  m. 

m’ 

Finalize 


Set 

N = max  { m : S®^  0 } , P = 

min  {m  : S®^  9^0} 

Set 

S = S®^  +S®yco  + ...  + S®^ 

Set 

II  X II  = a^S*^. 

As  was  the  case  with  Algorithm  2S,  a completely  robust  program  must  do  a little 
more.  Overflow  remains  possible  in  the  accumulators  and  the  final  summation  only  if  n = 
0(A),  which  is  highly  unlikely  unless  e^^^^  is  unusually  small.  Underflow  is  highly 
probable  in  forming  the  terms  of  the  final  summation.  This  could  result  in  a genuine  loss 
of  precision  if  underflows  are  replaced  by  zero  (abrupt  underflow)  but  good  protection 
against  such  loss  would  be  afforded  by  use  of  gradual  underflow.  To  achieve  similar 
protection  with  abrupt  underflow,  a more  complicated  version  of  the  final  decision 
process  of  Blue’s  algorithm  [2]  could  be  employed.  The  programming  required  to  render 
the  algorithm  robust  is  nontrivial;  this  point  may  be  regarded  as  the  central  conclusion  to 
be  drawn  from  Blue’s  paper  [2].  That  is, 

even  with  reasonably  good  information  about  the  floating-point 
arithmetic  it  is  still  not  easy  to  achieve  complete  robustness  even  for 
such  a conceptually  simple  calculation  as  a vector  norm. 

An  advantage  of  this  algorithm  over  Algorithm  2S  is  the  need  to  process  the  data 
vector  once  only  - although  that  processing  is  clearly  more  expensive.  A disadvantage  is 
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the  need  to  obtain  the  parameters  of  the  floating-point  model  from  the  computational 
environment  - a task  that  is  burdensome  but  necessary  when  the  program  is  to  be 
portable.  The  principal  additional  costs  in  Algorithm  3S  are  derived  from  the  need  for 
2M  accumulators  and  the  computation  of  m.  for  each  i. 

3.  Vectorized  Algorithms 

We  present  vectorized  versions  of  Algorithms  IS,  2S  and  3S.  First  we  describe,  as  a 
subsidiary  algorithm,  the  familiar  cascading  algorithm  for  vectorized  summation  to  form 


n 


(3.1) 


where  we  assume  n is  given  in  the  form 
n = 2*^  -I-  m 


(3.2) 


where  k and  m are  the  unique  nonnegative  integers  satisfying  (3.2)  with  m < 2*^. 


ALGORITHM  VS 


Initialize 


Set 


For  each  i = m-i-1,  m-i-2, ...,  n^ 


For  each  i = 1,  2, ...,  m 


Loop  For  each  j = 1,  2, ...,  k 


set  n.  = n._^  / 2 , 

set  t®  = i=l,2,  ...,n 


Finalize 


Set  S = 


If  the  add-and-store  operations  inside  the  loops  were  executed  entirely  in  parallel, 
this  complete  summation  would  require  just  k-i-1  such  operations.  That  is,  parallel 
execution  of  Algorithm  VS  requires  0(log2  n)  operations  as  opposed  to  the  0(n)  that  are 
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necessary  for  serial  execution.  Because  operations  on  a vector  processor  are  not 
completely  overlapped,  Algorithm  VS  needs  0(n)  operations  on  such  a machine  but  with 
a reduced  execution  time  per  operation.  The  saving  becomes  more  significant  as  the 
vector  length  increases. 

Algorithm  VS  employs  2*^  temporary  storage  locations  that  are  not  needed  in  a 
serial  algorithm.  It  is  characteristic  of  vector  (or  parallel)  algorithms  that  reduced 
execution  time  is  achieved  at  the  expense  of  additional  storage  requirements.  The 
initialization  phase  of  this  algorithm  compresses  the  original  vector  into  one  of  length  2^ 
in  a way  which  minimizes  the  arithmetic  operation  count  for  that  phase. 

The  principles  of  Algorithm  VS  extend  naturally  to  other  reduction  processes,  a 
term  that  is  sometimes  used  in  compiler  manuals  to  describe  mappings  from  vector  to 
scalar  quantities.  For  example  it  extends  to  the  computation  of  maxima  or  minima  of 
vectors.  The  operation  count  and  the  temporary  storage  requirements  are  unchanged, 
provided  that  the  operation  is  suitably  defined.  In  the  case  of  finding  the  maximum  of  a 
vector,  the  operation  is  finding  the  maximum  of  two  numbers  and  storing  the  result. 

Algorithm  VS  vectorizes  Algorithm  IS  by  simply  modifying  the  initialization 
step  to  use  the  components  IxJ^  and  the  finalization  step  to  compute  (t^^ 

ALGORITHM  IV 

Use  Algorithm  VS  modified  as  above  to  compute  !lxlL 

Algorithm  2S  can  be  similarly  vectorized. 

ALGORITHM  2V 

Use  Algorithm  VS,  suitably  modified,  to  compute  m = max  lx.l . 

Use  Algorithm  VS,  suitably  modified,  to  compute  lIxIL 

Each  of  the  Algorithms  VS,  IV  and  TV  is  easily  expressed  in  standard  Fortran 
and  a vectorizing  compiler  can  be  expected  to  vectorize  all  of  the  (inner)  loops.  We  turn 
now  to  the  possible  vectorization  of  Algorithm  3S.  The  initialization  step  vectorizes 
without  difficulty.  The  loop  presents  two  difficulties: 
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(i)  The  need  to  distinguish  the  cases  in  which  x.  = 0,  and 

(ii)  the  need  to  select  the  appropriate  accumulator  when  0. 

In  the  finalization  step,  the  vectorized  determination  of  N and  P is  not  readily  expressed 
in  standard  Fortran,  and  a choice  must  be  made  between  Algorithm  VS  and  Homer’s  rule 
for  evaluating  the  polynomial  expression. 

Let  us  consider  the  loop,  in  Algorithm  3S,  which  is  where  most  of  the  work  is 
done.  The  difficulties  (i)  and  (ii)  arise  from  the  need  to  maintain  an  uninterrupted  flow  of 
operands  to  the  vector  pipeline  of  the  floating-point  processor.  An  uninterrupted  flow  is 
possible  when  the  operands  are  spaced  regularly  in  the  computer  memory  and  all 
operands  undergo  exactly  the  same  sequence  of  arithmetic  operations.  These  conditions 
translate  into  restrictions  on  the  Fortran  statements  that  are  allowed  inside  a loop  in  order 
that  the  loop  be  vectorized  by  a Fortran  compiler. 

These  restrictions  are  not  the  same  for  different  Fortran  compilers.For  example 
one  reference  manual  [Cray]  states  that  "Loops  containing  a GO  TO,  IF,  CALL  or  I/O 
statement  are  not  vectorizable"  while  another  [Convex]  states  that  "DO  loops  containing 
nested  IF  statements  and  nonlinear  subscripts  (subscripts  whose  values  do  not  form 
arithmetic  progressions)  can  be  vectorized."  We  remark  that  the  code  for  the  loop  of 
Algorithm  3S  contains  both  IF  statements  and  nonlinear  subscripts.  The  difference 
between  the  statements  made  in  the  two  different  compiler  manuals  can  be  understood  in 
the  following  way. 

First,  we  remark  that  it  is  important  to  minimize  restrictions  on  inner-loop 
programming  because  they  exclude  important  algorithms.  In  recognition  of  this  fact, 
every  vectorizing  compiler  provides  nonstandard  capabilities  for  doing  so.  However  the 
price  of  these  nonstandard  capabilities  is  a loss  of  portability.  This  places  a burden  on 
programmers  and,  especially,  on  providers  of  software  libraries. 

Secondly,  we  remark  that,  conceptually  at  least,  a loop  containing  logical 
branches  can  be  separated  into  two  loops,  one  containing  the  logic  and  one  containing  the 
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arithmetic.  Some  compilers,  for  example  [CDC],  support  the  separation  of  arithmetic  and 
logic  in  a natural,  though  nonstandard,  way.  The  programmer  can  write  a loop  defining  a 
bit-vector  containing  one  bit  for  each  component  of  the  arithmetic  vector  operands.  The 
logic  loop  contains  logical  and  relational  operations  and  the  arithmetic  loop  refers  to  the 
previously  computed  bit-vector.  For  those  operations  where  the  appropriate  bit  is  set,  the 
arithmetic  result  is  stored  while  for  those  for  which  this  bit  is  not  set,  the  arithmetic  result 
is  not  stored.  In  this  latter  case,  any  floating-point  exceptions  which  may  arise  such  as 
underflow/overflow  or  invalid  arguments  to  functions  must  be  suppressed.  For  this 
separation  to  be  effective,  both  the  logic  loop  and  the  arithmetic  loop  must  be 
vectorizable. 

Finally,  we  observe  that  several  compilers  now  perform  the  separation  of  logic 
and  arithmetic  through  analysis  of  the  source  code,  at  least  in  sufficiently  simple  cases. 

The  following  is  one  plausible  vectorization  of  Algorithm  3S.  An  analog  of 
Algorithm  VS  is  used  to  perform  the  summations  subject  to  control  by  a bit- vector. 
ALGORITHM  3V 

Initialize 

Set  S^°^=  0,  m = -M-Hl,-M-H2,  ...,M. 

m 

[ r log„lx.l  1 

Set  m.  = ^ i = 1,  2, ...,  n. 

I 0 ifx.=0 

Loop 

For  each  m = — M-i-1,  -M-h2,  ...,  M: 

f 0 if  m m. 

Setb^"'^=  j ' i = l,2, ...,  n. 

[ 1 if  m = m. 

Use  Algorithm  VS  (controlled  by  b^*"^  and  appropriately 
initialized)  to  compute  S^^ . 

Finalize  as  in  Algorithm  3S. 
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A general  remark  is  in  order  on  the  efficiency  of  vectorizing  inner  loops  when 
some  of  the  loop  iterations  do  not  store  a result.  Such  iterations  take  the  same  time  as  do 
iterations  for  which  the  result  is  stored.  The  separation  approach  maintains  an 
uninterrupted  flow  in  the  vector  pipeline  but  clearly  the  effective  vector  rate  is  degraded. 
If  the  proportion  of  null  operations  is  large  enough,  a vector  loop  with  its  associated 
overhead  can  produce  an  effective  rate  slower  than  would  be  achieved  by  a scalar 
algorithm.  In  Algorithm  3V,  the  effective  vector  rate  is  eroded  rapidly  as  M increases. 
Therefore  this  algorithm  is  not  attractive  for  computing  11x11^  for  large  p. 

We  conclude  this  section  with  the  observation  that  an  extremely  simple  serial 
algorithm,  Algorithm  IS,  vectorizes  easily.  But  neither  Algorithm  IS  nor  IV  is  at  all 
robust,  i.e  resistant  to  overflow  and  underflow.  The  greater  robustness  of  Algorithms  2S 
and  3S  is,  of  course,  retained  by  their  vectorization.  But  it  is  probably  impossible  to 
vectorize  Algorithm  3S  in  a way  which  is  both  portable  and  efficient. 

4„  The  Symmetric  Level-Index  System 

In  this  section,  we  review  briefly  the  basic  definitions  and  properties  of  the  symmetric 
level-index  representation  and  its  arithmetic  and  go  on  to  the  summation  of  a vector  as  an 
sli  operation. 

4.1  Symmetric  Level-Index  Arithmetic 

A positive  number  X is  represented  in  the  level-index,  //,  system  by  x where 

X = (|)(x)  (4.1) 

and  the  generalized  exponential  function  (|)  is  defined  by 

f X 0<x  < 1, 

(1)  (X)  = (4.2) 

I exp  ((t)(x-l))  X > 1. 

The  representation  x is  written  in  the  form  / -i-  f where  the  level  I is  a nonnegative  integer 
and  the  index  f e [0,  1).  The  various  quantities  are  related  by  the  equation 
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X = exp(  exp( ...  (exp  f) ... ))  (4.3) 

where  the  exponential  function  is  applied  I times. 

For  the  symmetric  level-index  system,  numbers  in  the  interval  (0,  1)  are 
represented  by  the  li  images  of  their  reciprocals  together  with  an  indicator  to  show  that 
they  are  in  reciprocal  form.  Thus  the  sli  representation  can  be  described  as  representing  a 
real  number  X by  ± <})(x)~\  (There  are  other  ways  to  describe  this  representation  which 
are  sometimes  convenient  for  the  analytic  development  of  the  theory.)  The  arithmetic 
algorithms  for  the  li  and  sli  systems  are  described  in  detail  in  [7]  and  [9],  while  possible 
hardware  implementation  schemes  are  discussed  in  [20]  and  [18]. 

A minimal  description  of  the  arithmetic  algorithms  is  necessary  in  order  that  some 
of  the  available  simplicity  of  the  vector  operations  can  be  clearly  demonstrated.  Consider 
the  addition  (or  subtraction)  of  two  numbers  (t)(x)  and  with  x > y,  to  form  their  sum 
(or  difference)  ({)(z).  The  problem  is  to  find  z,  the  sli  representation  of  this  sum  or 
difference;  that  is,  we  seek  z such  that 

(})(z)  = (l)(x)  ± <))(y)  (4.4) 

This  is  achieved  via  the  computation  of  the  members  of  three  short  sequences  defined  by 
a-  = 1 1 (|)(x-j),  b.  = (l)(y-j)  / (l)(x-j),  c.  = ({)(z-j)  / (t)(x-j).  (4.5) 

These  sequences  are  computed  from  appropriate  starting  values  by  simple  recurrence 
relations  involving  evaluation  of  exponential  or  logarithmic  functions  for  special  and 
restricted  ranges  of  their  arguments.  Slight  variations  of  these  sequences  are  needed  for 
some  of  the  other  arithmetic  operations  involving  quantities  in  reciprocal  form,  but  the 
principle  is  similarly  straightforward.  The  details  of  the  computation  are  not  important  to 
the  present  discussion.  For  a detailed  description  of  these  algorithms  and  possible 
implementations  see,  for  example,  [8]. 

The  important  point  to  make  about  the  arithmetic  algorithms  at  this  point  is  that 
all  the  internal  computation  is  performed  in  fixed-point  fixed-precision  form.  It  is  this 
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fact  which  makes  considerable  economy-of-scale  available  in  the  implementation  of  the 
extended  arithmetic  operations  to  be  discussed  shortly. 

For  our  present  purposes,  it  is  not  the  details  of  the  algorithm  that  are  essential, 
but  more  the  properties  of  the  sli  system  once  implemented. 

One  of  the  most  important  properties  of  the  sli  system  is  closure.  Let  A(t,/)  be  the 
(finite)  set  of  all  sli  numbers  with  t-bit  indices  and  levels  not  greater  than  /.  In  [15]  it  is 
proved  that  A(t,/)  is  closed  under  the  four  basic  arithmetic  operations,  excluding  division 
by  zero,  if  / is  large  enough.  For  example  A(27,7)  is  closed  and  requires  only  32  bits  to 
represent  all  of  its  members  (27  for  the  index,  3 for  the  level  and  one  each  for  the  sign 
and  the  reciprocation  indicator).  In  contrast  to  the  IEEE  standard  P754  [IEEE],  it  is  not 
necessary  to  introduce  an  artificial  infinity  arithmetic  with  non-numerical  infinity 
symbols.  The  error  measure  for  symmetric  level-index  arithmetic  is  different  from  that 
for  floating-point.  The  appropriate  error  measure  for  the  sli  system  is  generalized 
precision  which  is  developed  and  discussed  in  [6];  it  corresponds  to  fixed  absolute 
precision  in  the  index. 

The  importance  of  closure  is  that  it  renders  the  system  entirely  free  from  overflow 
or  underflow  as  a result  of  arithmetic  operations.  As  we  shall  also  observe  later, 
generalized  precision  is  the  appropriate  measure  for  the  sort  of  calculation  of  current 
concern.  It  is  precisely  the  right  measure  to  use  when  comparing  very  large  or  very  small 
quantities  which  are  themselves  to  be  the  arguments  of  a high-order  root  function  so  that 
we  can  draw  appropriate  conclusions  about  the  accuracy  of  the  final  answer.  This 
particular  point  is  also  borne  out  by  the  computational  experience  with  the  sli  system 
reported  in  [10]  on  the  root-squaring  process  for  polynomial  root-finding  and  [21]  where 
the  p-norm  calculation  was  seen  to  be  highly  stable  as  p increased. 
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4.2  Summation  as  an  sli  operation 

In  following  sections,  we  shall  be  describing  the  sli  implementation  of  algorithms  for 
computing  scalar  products  and  p-norms  of  vectors.  It  is  immediately  apparent  that  both  of 
these  operations  must  rely  heavily  on  the  summation  of  the  components  of  a vector.  We 
conclude  this  section  with  a look  at  this  operation  which  is  the  subject  of  the  error 
analysis  discussion  in  the  next  section.  The  possibility  of  the  efficient  computation  of 
extended  sums  for  the  level-index  system  was  first  considered  in  [7],  a software 
implementation  is  discussed  in  detail  in  [21]  and  the  error  analysis  of  extended  arithmetic 
operations  in  the  li  and  sli  systems  is  discussed  in  [17]. 

Firstly,  the  implementation  of  the  operation  even  for  a serial  machine  becomes 
highly  efficient  once  the  largest  term  in  the  sum,  x^^  say,  has  been  identified.  In  this  case 
only  one  sequence  (aj]  and  only  one  [Cj]  need  be  computed.  The  whole  of  the  extended 
nature  of  the  operation  is  accounted  for  by  a redefinition  of  Cq  which  depends  on  values 
of  bp  (or  its  equivalent  for  quantities  in  reciprocal  form)  for  the  terms  to  be  summed.  That 
is,  for  the  simplest  case  where  all  terms  are  greater  than  unity,  we  set 


(4.6) 


where  the  summation  is  taken  over  all  terms  except  for  the  largest  and 


bo,  = 4’(x,)/<l>(x„„) 


(4.7) 


This  redefinition  of  Cq  simply  amounts  to  the  summation  of  these  fixed-point,  fixed- 
precision  quantities  bg^  - an  operation  which  is  easily  achieved  since  there  are  no 
alignment  shifts  or  normalizations  to  account  for.  The  savings  thus  achieved  amount  to 
about  two-thirds  of  the  work  which  might  otherwise  be  involved.  Details  of  the 
algorithmic  aspects  of  this  can  be  found  in  [21]. 

On  a machine  with  a sufficient  degree  of  parallelism  available  in  its  sli  processor, 
it  is  clear  that  all  the  sequences  (bj}  could  be  computed  simultaneously  and  so  the 
complete  operation  would  be  slowed  in  comparison  to  a single  addition  only  to  the  extent 
of  identifying  x^^  and  then  computing  the  redefined  Cq  via  a tree  of  adders.  Similarly,  for 
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a software  implementation  on  a vector  machine,  the  computation  of  the  various  quantities 
bg^  is  readily  vectorizable  as  in  Algorithm  VS. 

5.  Error  Analysis 

In  this  section  we  discuss  briefly  the  error  analysis  of  the  IEEE  standard  floating-point 
system  with  reference  to  the  effect  of  gradual  underflow  and  the  consequent  loss  of  a 
uniform  relative  error  bound  for  arithmetic  operations.  We  shall  also  consider  how  this 
affects  other  possible  floating-point-like  arithmetics  which  have  been  proposed.  The 
discussion  will  center  on  the  summation  operation. 

We  shall  also  include  a short  discussion  of  the  error  analysis  for  the  sli  operation 
of  forming  sums  either  by  repeated  addition  or  using  the  summation  algorithm  described 
briefly  in  the  last  section.  The  implementation  described  in  [21]  incorporates  this 
operation  along  with  the  formation  of  scalar  products  and  vector  norms  in  its  standard 
library  of  functions  and  procedures.  The  vector  sum  operation  is  central  to  all  of  these. 

5.1  Floating-point 

In  his  lengthy  paper  [11],  Demmel  highlights  some  of  the  advantages,  in  terms  of  writing 
robust  programs,  which  are  derived  from  the  inclusion  of  gradual  underflow  in  the 
floating-point  arithmetic.  Many  of  his  examples  are  drawn  from  linear  algebra 
applications  including  the  formation  of  scalar  products.  The  case  Demmel  puts  for  the 
inclusion  of  gradual  underflow  is  convincing  and  justifies  the  extra  complication  in  the 
floating-point  error  analysis  which  is  necessitated  by  the  loss  of  normalization  at  the 
underflow  threshold  X given  by  (2.7).  One  of  his  principal  arguments  in  favor  of  gradual 
underflow  is  the  consequent  relative  ease  of  writing  "highly  robust,  expert  codes  for 
problems  like  polynomial  root-finding";  the  basic  belief  being,  apparently,  that  it  is 
preferable  to  ease  the  programming  task  at  the  expense  of  slightly  more  complicated 
error  analyses.  This  is  a view  which  probably  meets  with  very  widespread  approval  and 
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acceptance  - and  certainly  that  of  the  present  authors. 

The  source  of  the  additional  difficulty  in  the  error  analysis  is  that,  while  relative 
error  in  the  floating-point  representation  is  bounded  by  the  machine  unit  e for  quantities 
in  excess  of  the  underflow  threshold  X,  for  smaller  quantities  the  absolute  bound  Xe  must 
be  used.  For  a t-digit  base  P significand  with  symmetric  rounding 
e = p'-72. 

It  is  apparent  from  [11]  that  the  detailed  analysis  of  algorithms  becomes  more  intricate  - 
even  though  the  final  error  bounds  achieved,  as  is  the  case  for  summation,  are  often  no 
more  complex  than  for  fully  normalized  arithmetic. 

Specifically  for  the  sum  of  N+l  floating-point  numbers  Xq,  Xj,  ...,  Xj^  of  the  same 
sign  which  are  assumed  to  be  exact,  we  see  from  [11]  that  the  final  rounding  error  is 
bounded  as  follows: 

I F1(I  X^)  - I xJ  < Ne  F1(I  X.)  / (1  - e)  (5.1) 

where  Fl(-)  stands  for  the  result  of  the  floating-point  operation,  so  that  the  relative 
rounding  error  is  bounded  by  (approximately)  N times  the  bound  for  a single  addition.  In 
the  case  of  other  extended  calculations  such  as  scalar  products,  the  error  bounds  are  a 
more  complicated  combination  of  relative  and  absolute  bounds  depending  on  the 
occurrence  of  (gradual)  underflow.  For  such  calculations  with  gradual  underflow,  the 
attractive  feature  of  a fixed  relative  error  bound  for  all  floating-point  arithmetic 
operations  is  lost. 

Other  unnormalized  floating-point  arithmetics  have  been  proposed  for  purposes 
of  VLSI  architectures  and  bit-by-bit  pipelining  of  arithmetic.  The  details  of  these 
arithmetic  systems  are  not  our  present  concern,  but  it  is  worth  saying  a little  about  the 
error  analysis  requirements  of  such  systems.  The  additional  complication  - in  relative 
error  terms  - is  much  greater  for  these  general  unnormalized  systems  than  for  the  special 
case  of  gradual  underflow.  It  has  been  studied  in  some  detail  by  Barlow  [1]  who  pays 
particular  attention  to  the  error  analysis  of  Gaussian  elimination.  The  essential  difference 
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between  his  analysis  and  the  analysis  of  conventional  floating-point  arithmetic  lies  in  the 
fact  that  relative  error  ceases  to  be  the  appropriate  measure.  It  is  replaced  by  "fractional 
error"  which  is  (essentially)  the  absolute  error  in  the  (unnormalized)  mantissa.  Like  the 
inclusion  of  gradual  underflow,  the  use  of  such  unnormalized  arithmetic  demands  the  use 
of  other  error  measures  as  well  as  relative  error. 

Gradual  underflow  (and  any  other  unnormalized  system)  has  the  effect  of 
extending  the  range  of  representable  numbers  close  to  zero.  Such  expedients  do  nothing 
to  alleviate  the  potential  dangers  at  the  other  end  of  the  range.  Matsui  and  Iri  [16]  and 
Hamada  [13]  have  suggested  extensions  of  the  floating-point  system  to  alleviate  the 
overflow  problem  by  allocating  variable- sized  segments  of  the  computer  word  to  the 
exponent  and  mantissa  of  floating-point  representations.  The  complication  of  the  relative 
error  analysis  of  such  systems  is  even  greater  than  those  mentioned  above.  Very  little  of 
this  analysis  exists. 

5.2  Symmetric  level-index 

For  symmetric  level-index  arithmetic,  too,  it  is  necessary  to  use  a different  error  measure, 
generalized  precision  [6].  Generalized  precision  has  some  significant  advantages 
compared  to  relative  error,  not  least  of  which  is  that  it  is  a metric  so  that  the  symmetry  of 
X approximating  x and  x approximating  x is  a natural  aspect  of  the  error  analysis. 
Detailed  error  analyses  of  numerical  processes  will  inevitably  be  different  in  this  system 
than  for  any  of  the  floating-point  systems  but  significant  progress  has  already  been  made 
in  this  respect.  (See,  for  example,  [6],  [8]  and  [17].) 

We  turn  now  to  questions  of  the  precision  that  can  be  achieved  in  extended 
calculations.  Olver  [17]  has  demonstrated  that,  at  relatively  low  cost,  it  is  possible  to 
perform  a Wilkinson-style  running  error  analysis.  Such  a running  analysis  is  particularly 
well- suited  to  a parallel  environment  since  it  would  be  performed  by  simultaneous 
duplication  of  the  operations  for  slightly  adjusted  data.  The  adjustments  are  similar  to  the 
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use  of  directed  rounding  in  interval  arithmetic  and  have  a similar  effect  in  yielding 
guaranteed  error  bounds  for  the  results  obtained. 

A first-order  error  analysis  for  extended  sums  and  products  leads  to  conclusions 
that  are  broadly  similar  to  those  for  the  floating-point  system  in  that  the  generalized 
precision  of  the  final  result  is  bounded  by  N times  the  generalized  precision  for 
individual  operations.  If  5 is  the  generalized  precision  of  the  sli  representation,  so  that 
8 = p'-/! 

for  symmetric  rounding  with  a t-digit  base  (3  index,  and  taking  all  the  terms  in  the  sum  to 
have  the  same  sign,  we  find  that  the  final  bound  for  the  error  5x  in  the  sli  representation  x 
of  the  extended  sum 

(t)(x)  = (j)(XQ)  -h  (t)(Xj)  + . . . -I-  (j)(Xj^) 

is  given  by 

I 5x  I < N 5.  (5.2) 

Compare  (5.1).  Similar  conclusions  apply  to  extended  products. 

Examples  of  such  computations  are  presented  in  [21].  They  demonstrate  that, 
even  in  cases  where  underflow  or  overflow  are  not  significant  problems,  sli  arithmetic 
provided  answers  of  similar  or  greater  relative  precision  than  their  floating-point 
counterpart. 

We  turn  now  to  the  error  analysis  of  the  direct  extended  sum  algorithm  described 
in  the  previous  section.  There  is  not  the  space  to  include  a fully  detailed  analysis  at  this 
point.  These  details  will  be  included  in  a forthcoming  paper.  We  confine  ourselves  to  a 
brief  outline  of  the  analysis  in  the  simplest  case  and  the  comment  that  the  other  cases  can 
be  handled  by  comparably  straightforward  extensions  of  the  error  analyses  in  [7]  and  [9]. 

The  important  finding  is  that  the  extended  summation  algorithm  can  reduce  the 
roundoff  error  significantly  by  comparison  with  (5.2)  above.  Indeed  the  error  6x  above 
can  be  bounded  by  (N-i- 1)5/2  which  is  only  about  half  the  error  committed  by  repeated 
addition.  Furthermore,  it  is  apparent  that  by  making  the  appropriate  choices  of  working 
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precisions  an  extended  summation  processor  could  be  designed  to  reduce  the  error  here 
to  the  same  magnitude  as  that  for  a single  operation.  The  price  of  this  would  be  that  all 
internal  calculation  is  computed  to  approximately  log2  K extra  bits  of  precision  where  K 
is  the  maximum  vector-dimension  available. 

6.  Parallelization  in  an  Sli  Environment 

In  Sections  2 and  3,  we  introduced  possible  algorithms  for  the  computation  of  p-norms 
and  studied  their  relative  efficiencies  in  various  computer  architectures.  In  this  section  we 
reexamine  those  algorithms  from  the  point  of  view  of  efficiency  in  parallel  architectures 
with  a built-in  sli  arithmetic  processor.  Of  course,  no  such  machine  exists  at  present 
but  one  of  our  conclusions  is  that  it  should  for  the  reasons  that  will  become  apparent. 

Let  us  first  recall  the  basic  definition  of  the  p-norm  of  a vector  x,  namely 

II  X lip  = { IXj|P  + Ix^lP  . . . -H  \xf  } (6.1) 

The  first  observation  to  make  is  that  because  of  the  freedom  from  overflow  and 
underflow  afforded  by  the  sli  system,  the  algorithm  implied  by  the  definition  is  a 
perfectly  feasible  method  for  the  computation  and  indeed  it  may  appear  to  be  the  optimal 
such  algorithm.  (We  discuss  a still  more  efficient  algorithm  later.)  This  would  be 
summarized  by  the  Pascal-like  procedure: 

for  i :=  1 to  n do  u.  :=  lx.P ; 
pnorm  :=  Root(SumVector(u),p); 

The  operation  within  the  loop  is  clearly  immediately  parallelizable  for  any  parallel 
architecture  and,  since  the  formation  of  the  powers  Ix.l^  is  a very  simple  operation  for  sli 
arithmetic  (as  is  the  taking  of  the  p*  root),  this  overall  operation  is  particularly  well- 
suited  to  any  parallel  environment  which  is  equipped  with  a symmetric  level-index 
arithmetic  processor.  This  is  especially  true  in  the  case  where  the  sli  processor  itself  has  a 
high  degree  of  parallelism  so  that  the  SumVector  operation  can  benefit  from  all  the 
acceleration  described  in  the  previous  section. 
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In  the  event  that  such  parallelism  is  not  available,  then  the  summation  is 
immediately  amenable  to  pipelining  or  to  some  tree-structured  addition  procedure  for  a 
multiprocessor  architecture.  Of  course,  for  either  of  these  latter  operations,  similar 
considerations  as  for  the  floating-point  system  must  be  made  with  regard  to  organization 
of  the  data  in  order  to  maximize  the  overall  precision  of  the  final  result. 

At  this  stage  then,  it  would  appear  that  the  sli  system  is  ideally  suited  to  parallel 
computing  environments  and  that  much  of  the  expected  slowdown  of  individual 
calculations  in  comparison  with  floating-point  will  be  handsomely  compensated  in  such 
situations.  However,  we  have  barely  scratched  the  surface  of  the  available  savings. 

Consider  again  the  situation  of  a symmetric  level-index  arithmetic  processor  with 
a high-level  of  on-chip  parallelism.  In  this  situation,  we  find  that  the  operation  of 
computing  vector  norms  could  sensibly  be  incorporated  as  a single  built-in  vector 
function  which  could  be  computed  in  not  much  more  than  a single  sli  operation  time. 

The  algorithm  is  based  on  a similar  approach  to  that  outlined  above  for 
summation;  it  entails  a simple  redefinition  of  Cq,  the  starting  point  of  the  sequence  {cp. 
Just  as  with  summation,  a single  sequence  (aj)  suffices  for  the  whole  operation,  while  the 
several  sequences  {bj)  can  be  computed  simultaneously. 

The  appropriate  value  for  Cq  is  given  by 

c„  = {Zbg,  }■'>’;  (6.2) 

compare  (4.6).  At  first  sight  this  looks  almost  as  complicated  as  the  original  operation, 
but  this  first  impression  is  misleading.  Since  bp^  is  computed  by  an  evaluation  of  the 
exponential  function,  b^^  can  be  obtained  simply  by  multiplying  the  argument  by  p. 
(Again  this  is  a fixed  absolute  precision  operation.)  Similarly,  the  formation  of  the  p* 
root  in  (6.2)  is  simply  incorporated  into  the  computation  of 

Cj  = 1 -I-  aj  In  Cq  (6.3) 

by  the  fixed-precision  division  of  the  logarithm  of  Z bg.  by  p.  The  only  time  penalties 
incurred  here  relative  to  a single  sli  operation  are  in  identifying  the  largest  element  of  the 
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vector  and  in  performing  the  extended  fixed-point  summation  using  an  efficient  tree- 
structure. 

The  location  of  the  dominant  element  - or  perhaps  a more  complex  sorting 
operation  - is  required  of  all  the  algorithms  under  serious  consideration.  In  the  case  of  the 
symmetric  level-index  representation,  this  operation  need  be  no  more  complicated  than 
for  integer  variables  since  the  bit-patterns  used  for  the  representation  can  be  organized  to 
preserve  the  natural  integer  ordering.  This  fact  is  utilized  in  the  implementation  described 
in  [21]  in  which  vector  norms  are  computed  using  the  algorithm  just  outlined  - in  a serial 
implementation. 

We  see  here  that  the  symmetric  level-index  system  allows  the  straightforward 
definition  of  the  vector  norms,  Algorithm  IS,  to  be  used  for  their  computation  and  that 
this  permits  immediate  vectorization  or  high-degree  parallelization  for  any 
supercomputer  architecture.  In  the  event  that  this  architecture  allows  significant 
parallelism  within  the  sli  processor  the  operation  can  be  made  especially  efficient. 

At  this  point  it  is  natural  to  investigate  the  more  general  task  of  computing  scalar 
products  of  two  vectors.  (See  [11],  Section  7 for  a simple  example  which  illustrates  the 
difficulty  in  producing  robust  floating-point  algorithms  for  this  task.)  This  operation  has 
been  one  of  the  main  planks  on  which  the  interval  arithmetic  packages  such  as  that 
described  in  [14]  have  been  built. 

One  likely  suggestion  for  the  computation  of  the  inner  product  (x,  y)  of  the  two 
vectors  x and  y in  a floating-point  environment  would  be  that  each  vector  should  be 
scaled  by  its  largest  element  and  the  final  result  scaled  in  a compensatory  manner. 
However,  as  we  see  in  the  next  section,  this  may  be  totally  inappropriate  since  these 
elements  may  themselves  contribute  to  almost  insignificant  terms  of  the  inner  product. 
Perhaps  any  scaling  should  be  with  respect  to  the  largest  term  of  the  inner  product  itself 
but  this  would  add  a generally  unacceptable  burden  to  the  whole  procedure. 
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It  is  perfectly  conceivable  to  achieve  some  similar  effect  by  considering  the 
scaling  as  an  exponent  shift  where  the  largest  sum  of  exponents  for  the  terms  x.y.  would 
be  used.  However  this  requires  a considerably  more  complicated  sorting  procedure  to  be 
adopted  and  then  a decision  as  to  how  the  overall  shift  should  be  divided  between  the  two 
vectors.  The  additional  start-up  time  for  any  vectorized  algorithm  utilizing  any  of  these 
approaches  would  be  prohibitive  in  almost  all  cases. 

The  corresponding  scalar  product  algorithm  for  the  symmetric  level-index 
computing  environment  is  described  by  the  following  PASCAL-like  code 
for  i :=  1 to  n do  w.  :=  x.*y.; 

ScalarProd  :=  SumVector(w); 

that  is,  by  a straightforward  application  of  the  definition.  As  we  have  already  seen,  the 
SumVector  operation  is  well-suited  to  parallel  implementation,  and  the  loop  is  clearly 
immediately  vectorizable  or  adaptable  to  any  parallel  environment. 

7.  Computational  Examples 

Among  the  algorithms  discussed  in  Section  2,  Algorithm  3S  is  the  algorithm  of  choice 
for  complete  robustness  in  a serial  floating-point  environment.  In  a vector  processor 
environmment  Algorithm  2V  becomes  preferable  due  to  its  relative  ease  of  vectorization 
and  normally  adequate  robustness.  If  a robust  arithmetic  such  as  symmetric  level-index  is 
used  then  the  simplest  algorithm.  Algorithm  IS  (or  IV),  is  the  obvious  choice.  As  a first 
example,  we  compare  the  results  of  computing  llxllj,  IIxIIjq,  HxIIjqq  and  HxIIjq^q  for  vectors 
of  length  10,  100  and  1000  using  Algorithm  IS  in  32-bit  sli  arithmetic  and  both 
Algorithms  2S  and  3S  in  IEEE  standard  (single  precision)  arithmetic  with  abrupt  rather 
than  gradual  underflows. 

The  sli  computation  was  performed  using  the  Turbo  PASCAL  unit,  SLIUNIT, 
which  was  described  in  [21].  This  unit  implements  all  the  standard  arithmetic  operations, 
elementary  functions  and  the  extended  operations  such  as  computation  of  p-norms  and 
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scalar  products  of  present  interest.  Table  1 presents  the  sli  results  and  the  floating-point 
results  for  Algorithm  2S.  The  error  estimates  were  made  by  comparison  with  extended 
precision  floating-point  computations. 


Table  1 

—8 

Relative  error  in  llxllp  for  vectors  x = (1,  2, ...,  n)  measured  in  units  of  10  . 
(Upper  entries  for  floating-point,  lower  entries  for  sli.) 


P 

n 

10 


100 


1000 


1 

0 

_0 

10 

_0 

6 

12 


10  100  1000 

3 2 0 

_3 2 0 

2 1 4 

_6 1 4 

2 2 2 

2 2 2 


For  n = 10,  both  the  floating-point  and  symmetric  level-index  computations 
returned  55  exactly  for  llxllj  and  10  exactly  for  HxIIjqqq.  Since  IIxIIjqqq=  10(1  -1-0(10"^®)) 
this  accounts  for  the  lack  of  error  in  this  case.  In  general,  the  relative  error  is  quite  flat 
throughout  the  entire  test  for  both  arithmetics. 

Although  many  (abrupt)  underflows  were  reported  in  the  floating-point  programs 
for  p = 100  and  p = 1000,  the  error  in  these  results  does  not  reflect  any  undue  loss  of 
precision.  We  ran  the  same  tests  using  Algorithm  3S.  The  errors  observed  were  not  much 
different  - in  some  cases  slightly  smaller  and  in  others  slightly  larger.  No  underflows 
were  reported  by  Algorithm  3S. 

The  error  analysis  of  Section  5 applies  directly  to  the  cases  in  the  first  column  of 
Table  1.  Let  us  examine  the  case  n = 1000,  p = 1.  Using  (5.1)  with  N = 999  and  e = 2"^^ 
we  find  that  the  relative  error  should  be  bounded  by 
Ne/(1  -e)  - 6x  10"^ 
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The  floating-point  error  shown  in  Table  1 is  1000  times  smaller,  even  though  scalings  in 
the  algorithm  are  not  accounted  for  in  the  bound  (5.1).  Similarly,  using  the  bound  (5.2)  - 
which  is  the  appropriate  one  for  the  algorithm  used  here  - with  N = 999  and  5 = the 
generalized  precision  of  1 1x1 1 j should  be  bounded  by 
N 5 = 7.4  X 10“^, 

which  corresponds  to  a relative  error  of  approximately  2.5  x lO"'*.  The  relative  error  of 
the  sli  result  shown  is  about  2000  times  smaller. 

As  a second  example,  we  consider  the  computation  of  a scalar  product. 
Specifically,  consider  the  vectors  u,  v whose  components  are  given  by: 
u.  = uf_j  (i  = 1,2,...,6);  Ug  given  and 

Vg  = - Ug,  V,  = - Ug,  Vg  = Uj,  Vg  = Ug,  V-  = U- , otheHvise. 

The  scalar  product  of  these  two  vectors  was  computed  both  in  sli  arithmetic  and  using 
floating-point  with  scaling  by  the  largest  element  as  described  in  the  previous  section. 
(This  is  also  the  scaling  used  for  Algorithm  2S.)  For  the  first  and  simplest  case  with  Ug  = 
1,  both  systems,  of  course,  produced  the  correct  answer.  However  on  setting  Ug  = 2,  so 
that  u.  = 2'^(2‘),  the  floating-point  computation  yielded  u\  = 0 because  the  scaling  by 

the  largest  elements  in  the  arrays  (which  in  this  case  are  both  2^)  results  in  all  the  middle 

terms  in  the  scalar  product  underflowing  to  zero,  while  the  terms  UgVg  and  UjVj  are 
exacdy  cancelled  by  the  last  two  terms. 

The  sli  result  here  is  4.29497  E-i-09.  The  exact  scalar  product  is  4 295  033  088  so 
that  the  relative  error  in  the  final  result  is  less  than  1.5x10”^. 
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8.  Conclusions 

The  principal  conclusions  to  be  drawn  from  this  work  are  as  follows. 

1 . The  simplest  algorithms  for  computing  p-norms  or  scalar  products  of  vectors  are 
readily  vectorizable  but  not  robust. 

2.  Completely  robust  accurate  algorithms  for  the  p-norm  are,  at  best,  very  difficult  to 
vectorize  efficiently. 

3.  Algorithm  2V  represents  a normally  acceptable  compromise  between  the  goals  of 
robustness  and  vectorization. 

4.  Much  of  the  difficulty  in  obtaining  parallel  algorithms  that  are  completely  robust 
in  floating-point  arithmetic  for  these  vector  operations  is  caused  by  the  need  to 
safeguard  against  overflow  and  (potentially  harmful)  underflow  and  to  preserve 
accuracy  in  the  computed  result. 

5.  The  symmetric  level-index  arithmetic  system  alleviates  all  of  these  difficulties 
because,  within  that  system,  the  simplest  algorithms  are  robust,  portable,  accurate 
and  immediately  parallelizable  owing  to  the  provision  of  a (completely)  robust 
arithmetic. 

The  computation  of  p-norms  and  scalar  products  are  merely  illustrations  of  a 

more  general  pattern  in  scientific  computing  applications  where,  often,  considerations  of 

robustness  and  the  desire  for  parallelization  of  floating-point  algorithms  conflict.  For  all 

of  these  reasons  we  draw  the  final  conclusion 

6.  The  more  "super"  the  computer,  the  greater  the  need  for  "super"  arithmetic  such 
as  the  symmetric  level-index  system. 


Supercomputers  need  super  arithmetic 


26 


REFERENCES 

[1]  J.L.Barlow,  Error  analysis  in  unnormalized  floating  point  arithmetic.  Report  CS- 
88-10,  April  1988,  Dept  Computer  Science,  Pennsylvania  State  University. 

[2]  J.L.Blue,  A portable  Fortran  program  to  find  the  euclidean  norm  of  a vector, 
ACM  Trans  Math  Software  4 (1978)  15-23. 

[3]  W.S. Brown,  A realistic  model  of  floating-point  computation.  Mathematical 
Software  in  (J.R.Rice,  Ed.)  Academic  Press,  New  York,  1977,  pp  343-360. 

[4]  W.S. Brown,  A simple  but  realistic  model  of  floating-point  computation,  ACM 
Trans  Math  Software  7 (1981)  445-480. 

[5]  W.S. Brown  and  S.I.Feldman,  Environment  parameters  and  basic  functions  for 
floating-point  computation,  ACM  Trans  Math  Software,  6 (1980)  510-523. 

[6]  C.W.Clenshaw  and  F.W.J.Olver,  Beyond  floating  point,  J.  ACM  31  (1984)  319- 
328. 

[7]  C.W.Clenshaw  and  F.W.J.Olver,  Level-index  arithmetic  operations,  SIAM  J Num 
Anal  24(1987)  470-485. 

[8]  C.W.Clenshaw,  F.W.J.Olver  and  P.R.Tumer,  Level-index  arithmetic:  An 
introductory  survey,  Proc.  Numerical  Analysis  Summer  School,  Lancaster,  1987, 
Springer  Verlag  (1989)  to  appear. 

[9]  C.W.Clenshaw  and  P.R.Tumer,  The  symmetric  level-index  system,  IMA  J Num 
Anal  8 (1988)  517-526. 

[10]  C.W.Clenshaw  and  P.R.Tumer,  Root-squaring  using  level-index  arithmetic,  to 
appear. 

[11]  J.Demmel,  Underflow  and  the  reliability  of  numerical  software,  SIAM  J Sci  Stat 
Comp  5 (1984)  887-919. 

[12]  P.A.Fox,  A.D.Hall  and  N.L.Schryer,  The  PORT  mathematical  subroutine  library, 
ACM  Trans  Math  Software  4 (1978)  104-126. 

[13]  H.Hamada  URR:  Universal  representation  of  real  numbers,  New  Generation 
Computing,  1 (1983)  205-209. 

[14]  U.W.Kulisch  and  W.L.Miranker,  The  arithmetic  of  the  digital  computer:  A new 
approach,  SIAM  Review  28  (1986)  1-40. 

[15]  D.W.Lozier  and  F.W.J.Olver,  Closure  and  precision  in  level-index  arithmetic. 
Manuscript. 

[16]  S.Matsui  and  M.Iri  An  overfiowl underflow  - free  floating-point  representation  of 
numbers,  J.  Information  Proc.  4 (1981)  123-133 

[17]  F.W.J.Olver,  Rounding  errors  in  algebraic  processes  - in  level-index  arithmetic, 
Proc.  Reliable  Numerical  Computation  (M.G.Cox  and  S.Hammarling,  eds.) 


Supercomputers  need  super  arithmetic 


27 


Oxford,  1989,  to  appear. 

[18]  F.W.J.Olver  and  P.R.Tumer,  Implementation  of  level-index  arithmetic  using 
partial  table  look-up,  Proc.  ARITH8,  (M.J.Irwin  and  R.Stefanelli,  Eds.)  EEEE 
Computer  Society,  Washington,  DC,  1987,  144-147. 

[19]  M.J.D.PoweU,  Radial  basis  functions  for  multivariable  interpolation:  A review, 
Algorithms  for  Approximation  143  - 167  (M.G.Cox  and  J.C.Mason,  eds.)  Oxford, 
1987. 

[20]  P.R.Tumer,  Towards  a fast  implementation  of  level-index  arithmetic.  Bull  IMA 
22(1986)  188-191. 

[21]  P.R.Tumer,  A software  implementation  of  sli  arithmetic,  pp.  18-24,  Proc. 
ARITH9,  (Ercegovac  and  Swartzlander,  Eds)  IEEE  Computer  Society, 
Washington  DC,  September  1989. 

[CDC]  Fortran  200  Version  1 Reference  Manual,  60480200,  Rev.  5,  Oct  23, 

1987,  Control  Data  Corporation,  Sunnyvale  CA. 

[Convex]  Convex  Fortran  Language  Reference  Manual,  720-000050-202,  6* 
edition.  Rev  1,  May  1988,  Convex  Computer  Corporation,  Richardson,  TX. 

[Cray]  Fortran  (CFT)  Reference  Manual,  SR-0009,  August  1981,  Cray  Research 

Inc.,  Mendota  Heights,  MN 

[IEEE]  IEEE  Standard  for  Binary  Floating-Point  Arithmetic,  ANSI/IEEE  Std  754- 

1985,  IEEE  Inc.,  New  York,  NY 


U.S.  DEPT.  OF  COMM. 

BIBLIOGRAPHIC  DATA 
SHEET  (See  instructions) 


1.  PUBLICATION  OR 
REPORT  NO. 

NISTIR  89-4135 


2.  Performing  Orgaui.  Report  No. 


3.  Publication  Date 

OCTOBER  1989 


4.  TITLE  AND  SUBTITLE 

Supercomputers  Need  Super  Arithmetic 


5.  AUTHOR(S) 

D.  W.  Lozier  and  P,  R.  Turner 


6.  PERFORMING  ORGANIZATION  (If  joint  or  other  than  NBS,  see  in  struction  s) 

U.S.  OEPAATMENT  OF  COMMERCE 

NATIONAL  INSTITUTE  OF  STANDARDS  AND  TECHNOLOGY 
GAITHERSBURG,  MD  20899 


7.  Contract/Grant  No. 

N/A 


8.  Type  of  Report  & Period  Covered 

Final 


9.  SPONSORING  ORGANIZATION  NAME  AND  COMPLETE  ADDRESS  (Street,  City.  State,  ZIP) 

NIST 

Department  of  Commerce  Mathematics  Department 

Gaithersburg,  MD  20899  U.  S.  Naval  Academy 

Annapolis,  MD  21402 


10.  SUPPLEMENTARY  NOTES 


I I Document  describes  a computer  program;  SF-185,  FIPS  Software  Summary,  is  attached. 


11.  ABSTRACT  (A  200~word  or  less  factual  summary  of  most  significant  information.  If  document  includes  a significant 
bi  bliography  or  literature  survey,  mention  it  here) 

The  title  of  the  paper  is  justified  by  the  consideration  of  the  parallel 
computation  of  vector  norms  and  inner  products  in  floating-point  and  a 
proposed  new  form  of  computer  arithmetic,  the  symmetric  level-index  systemo 


12.  KEY  WORDS  (Six  to  twelve  entries;  alphabetical  order;  capitalize  only  proper  names;  and  separate  key  words  by  semicolons) 

Error  analysis;  floating-point;  parallel  computation;  symmetric  level-index; 
vector  norms;  vectorized  algorithms. 


13.  availability 

Unl  imited 

I I For  Official  Distribution.  Do  Not  Release  to  NTIS 

I I Order  From  Superintendent  of  Documents,  U.S.  Government  Printing  Office,  Washington,  D.C. 
20402. 

[XX  Order  From  National  Technical  Information  Service  (NTIS),  Springfield,  VA.  22161 


14.  NO.  OF 

PRINTED  PAGES 


30 


15.  Price 


AO  3 


USCOMM-DC  6043-P80 


